Marks: 60
The stock market has consistently proven to be a good place to invest in and save for the future. There are a lot of compelling reasons to invest in stocks. It can help in fighting inflation, create wealth, and also provides some tax benefits. Good steady returns on investments over a long period of time can also grow a lot more than seems possible. Also, thanks to the power of compound interest, the earlier one starts investing, the larger the corpus one can have for retirement. Overall, investing in stocks can help meet life's financial aspirations.
It is important to maintain a diversified portfolio when investing in stocks in order to maximise earnings under any market condition. Having a diversified portfolio tends to yield higher returns and face lower risk by tempering potential losses when the market is down. It is often easy to get lost in a sea of financial metrics to analyze while determining the worth of a stock, and doing the same for a multitude of stocks to identify the right picks for an individual can be a tedious task. By doing a cluster analysis, one can identify stocks that exhibit similar characteristics and ones which exhibit minimum correlation. This will help investors better analyze stocks across different market segments and help protect against risks that could make the portfolio vulnerable to losses.
Trade&Ahead is a financial consultancy firm who provide their customers with personalized investment strategies. They have hired you as a Data Scientist and provided you with data comprising stock price and some financial indicators for a few companies listed under the New York Stock Exchange. They have assigned you the tasks of analyzing the data, grouping the stocks based on the attributes provided, and sharing insights about the characteristics of each group.
# Libraries to help with reading and manipulating data
import numpy as np
import pandas as pd
# Libraries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns
# to scale the data using z-score
from sklearn.preprocessing import StandardScaler
# to compute distances
from scipy.spatial.distance import cdist, pdist
# to perform k-means clustering and compute silhouette scores
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
# to visualize the elbow curve and silhouette scores
from yellowbrick.cluster import KElbowVisualizer, SilhouetteVisualizer
# Removes the limit for the number of displayed columns
pd.set_option("display.max_columns", None)
# Sets the limit for the number of displayed rows
pd.set_option("display.max_rows", 200)
# to perform hierarchical clustering, compute cophenetic correlation, and create dendrograms
from sklearn.cluster import AgglomerativeClustering
from scipy.cluster.hierarchy import dendrogram, linkage, cophenet
# to perform PCA
from sklearn.decomposition import PCA
# to suppress warnings
import warnings
warnings.filterwarnings("ignore")
from google.colab import drive
drive.mount('/content/drive')
Mounted at /content/drive
#import the data set
stock = pd.read_csv('/content/drive/MyDrive/DSBA Certificate/Unsupervised Learning/stock_data.csv')
stock.shape
(340, 15)
There are 340 rows and 15 columns in the data
stock.head()
| Ticker Symbol | Security | GICS Sector | GICS Sub Industry | Current Price | Price Change | Volatility | ROE | Cash Ratio | Net Cash Flow | Net Income | Earnings Per Share | Estimated Shares Outstanding | P/E Ratio | P/B Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | AAL | American Airlines Group | Industrials | Airlines | 42.349998 | 9.999995 | 1.687151 | 135 | 51 | -604000000 | 7610000000 | 11.39 | 6.681299e+08 | 3.718174 | -8.784219 |
| 1 | ABBV | AbbVie | Health Care | Pharmaceuticals | 59.240002 | 8.339433 | 2.197887 | 130 | 77 | 51000000 | 5144000000 | 3.15 | 1.633016e+09 | 18.806350 | -8.750068 |
| 2 | ABT | Abbott Laboratories | Health Care | Health Care Equipment | 44.910000 | 11.301121 | 1.273646 | 21 | 67 | 938000000 | 4423000000 | 2.94 | 1.504422e+09 | 15.275510 | -0.394171 |
| 3 | ADBE | Adobe Systems Inc | Information Technology | Application Software | 93.940002 | 13.977195 | 1.357679 | 9 | 180 | -240840000 | 629551000 | 1.26 | 4.996437e+08 | 74.555557 | 4.199651 |
| 4 | ADI | Analog Devices, Inc. | Information Technology | Semiconductors | 55.320000 | -1.827858 | 1.701169 | 14 | 272 | 315120000 | 696878000 | 0.31 | 2.247994e+09 | 178.451613 | 1.059810 |
stock.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 340 entries, 0 to 339 Data columns (total 15 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Ticker Symbol 340 non-null object 1 Security 340 non-null object 2 GICS Sector 340 non-null object 3 GICS Sub Industry 340 non-null object 4 Current Price 340 non-null float64 5 Price Change 340 non-null float64 6 Volatility 340 non-null float64 7 ROE 340 non-null int64 8 Cash Ratio 340 non-null int64 9 Net Cash Flow 340 non-null int64 10 Net Income 340 non-null int64 11 Earnings Per Share 340 non-null float64 12 Estimated Shares Outstanding 340 non-null float64 13 P/E Ratio 340 non-null float64 14 P/B Ratio 340 non-null float64 dtypes: float64(7), int64(4), object(4) memory usage: 40.0+ KB
stock.describe(include='all').T
| count | unique | top | freq | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Ticker Symbol | 340 | 340 | AAL | 1 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Security | 340 | 340 | American Airlines Group | 1 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| GICS Sector | 340 | 11 | Industrials | 53 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| GICS Sub Industry | 340 | 104 | Oil & Gas Exploration & Production | 16 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Current Price | 340.0 | NaN | NaN | NaN | 80.862345 | 98.055086 | 4.5 | 38.555 | 59.705 | 92.880001 | 1274.949951 |
| Price Change | 340.0 | NaN | NaN | NaN | 4.078194 | 12.006338 | -47.129693 | -0.939484 | 4.819505 | 10.695493 | 55.051683 |
| Volatility | 340.0 | NaN | NaN | NaN | 1.525976 | 0.591798 | 0.733163 | 1.134878 | 1.385593 | 1.695549 | 4.580042 |
| ROE | 340.0 | NaN | NaN | NaN | 39.597059 | 96.547538 | 1.0 | 9.75 | 15.0 | 27.0 | 917.0 |
| Cash Ratio | 340.0 | NaN | NaN | NaN | 70.023529 | 90.421331 | 0.0 | 18.0 | 47.0 | 99.0 | 958.0 |
| Net Cash Flow | 340.0 | NaN | NaN | NaN | 55537620.588235 | 1946365312.175789 | -11208000000.0 | -193906500.0 | 2098000.0 | 169810750.0 | 20764000000.0 |
| Net Income | 340.0 | NaN | NaN | NaN | 1494384602.941176 | 3940150279.327936 | -23528000000.0 | 352301250.0 | 707336000.0 | 1899000000.0 | 24442000000.0 |
| Earnings Per Share | 340.0 | NaN | NaN | NaN | 2.776662 | 6.587779 | -61.2 | 1.5575 | 2.895 | 4.62 | 50.09 |
| Estimated Shares Outstanding | 340.0 | NaN | NaN | NaN | 577028337.75403 | 845849595.417695 | 27672156.86 | 158848216.1 | 309675137.8 | 573117457.325 | 6159292035.0 |
| P/E Ratio | 340.0 | NaN | NaN | NaN | 32.612563 | 44.348731 | 2.935451 | 15.044653 | 20.819876 | 31.764755 | 528.039074 |
| P/B Ratio | 340.0 | NaN | NaN | NaN | -1.718249 | 13.966912 | -76.119077 | -4.352056 | -1.06717 | 3.917066 | 129.064585 |
#check for missing data
stock.isnull().sum()
| 0 | |
|---|---|
| Ticker Symbol | 0 |
| Security | 0 |
| GICS Sector | 0 |
| GICS Sub Industry | 0 |
| Current Price | 0 |
| Price Change | 0 |
| Volatility | 0 |
| ROE | 0 |
| Cash Ratio | 0 |
| Net Cash Flow | 0 |
| Net Income | 0 |
| Earnings Per Share | 0 |
| Estimated Shares Outstanding | 0 |
| P/E Ratio | 0 |
| P/B Ratio | 0 |
There are no missing entries in the data set.
#check for duplicates
stock.duplicated().sum()
0
There are no duplicated entries in the dataset!
#copy the data
data = stock.copy()
Questions:
#apply shared codes for created plots to visualize data
# function to create labeled barplots
def labeled_barplot(data, feature, perc=False, n=None):
"""
Barplot with percentage at the top
data: dataframe
feature: dataframe column
perc: whether to display percentages instead of count (default is False)
n: displays the top n category levels (default is None, i.e., display all levels)
"""
total = len(data[feature]) # length of the column
count = data[feature].nunique()
if n is None:
plt.figure(figsize=(count + 1, 5))
else:
plt.figure(figsize=(n + 1, 5))
plt.xticks(rotation=90, fontsize=15)
ax = sns.countplot(
data=data,
x=feature,
palette="Paired",
order=data[feature].value_counts().index[:n].sort_values(),
)
for p in ax.patches:
if perc == True:
label = "{:.1f}%".format(
100 * p.get_height() / total
) # percentage of each class of the category
else:
label = p.get_height() # count of each level of the category
x = p.get_x() + p.get_width() / 2 # width of the plot
y = p.get_height() # height of the plot
ax.annotate(
label,
(x, y),
ha="center",
va="center",
size=12,
xytext=(0, 5),
textcoords="offset points",
) # annotate the percentage
plt.show() # show the plot
# function to plot a boxplot and a histogram along the same scale.
def histogram_boxplot(data, feature, figsize=(12, 7), kde=True, bins=None):
"""
Boxplot and histogram combined
data: dataframe
feature: dataframe column
figsize: size of figure (default (12,7))
kde: whether to the show density curve (default False)
bins: number of bins for histogram (default None)
"""
f2, (ax_box2, ax_hist2) = plt.subplots(
nrows=2, # Number of rows of the subplot grid= 2
sharex=True, # x-axis will be shared among all subplots
gridspec_kw={"height_ratios": (0.25, 0.75)},
figsize=figsize,
) # creating the 2 subplots
sns.boxplot(
data=data, x=feature, ax=ax_box2, showmeans=True, color="violet"
) # boxplot will be created and a star will indicate the mean value of the column
sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins, palette="winter"
) if bins else sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2
) # For histogram
ax_hist2.axvline(
data[feature].mean(), color="green", linestyle="--"
) # Add mean to the histogram
ax_hist2.axvline(
data[feature].median(), color="black", linestyle="-"
) # Add median to the histogram
labeled_barplot(data, 'GICS Sector', perc=True)
The top economic sectors are Industrials (15.6%), Financials (14.4%), and Healthcare/Consumer Discreationary (both 11.8%). These sectors make up approx. 53.6% of the dataset. The least common sector is Telecommunications Services (1.5% of data).
labeled_barplot(data, 'GICS Sub Industry', perc=True)
# understand distributions of numerical columns
num_cols = data.select_dtypes(include=np.number).columns.tolist()
for x in num_cols:
histogram_boxplot(data, x)
#heatmap to see correlation amongst numeric variables
plt.figure(figsize=(15, 7))
sns.heatmap(
data[num_cols].corr(), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral"
)
plt.show()
Strong Positive Correlation:
Strong Negative Correlations:
There are no variables with correlation more than 0.70 / less than -0.70. We will not drop any correlated variables to change dimensionality.
sns.pairplot(data[num_cols], diag_kind='kde')
plt.show()
Checking the peaks of the KDE plots of different variables, we may expect to apply 3-4 clusters when working with K-Means Clustering.
#question 2
#compare GICS Sector and Price Change
plt.figure(figsize=(15, 7))
sns.boxplot(x=data["GICS Sector"], y=data['Price Change'])
plt.show()
Response to Question 2:
#question 4
##compare GICS Sector and Cash Ratio
plt.figure(figsize=(15, 7))
sns.boxplot(x=data["GICS Sector"], y=data['Cash Ratio'])
plt.show()
Response to Question 4:
#question 5
#compare P/E and GICS Sectors
#compare GICS Sector and Price Change
plt.figure(figsize=(15, 7))
sns.boxplot(x=data["GICS Sector"], y=data['P/E Ratio'])
plt.show()
Response to Question 5:
#copy data
df = data.copy()
df.head()
| Ticker Symbol | Security | GICS Sector | GICS Sub Industry | Current Price | Price Change | Volatility | ROE | Cash Ratio | Net Cash Flow | Net Income | Earnings Per Share | Estimated Shares Outstanding | P/E Ratio | P/B Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | AAL | American Airlines Group | Industrials | Airlines | 42.349998 | 9.999995 | 1.687151 | 135 | 51 | -604000000 | 7610000000 | 11.39 | 6.681299e+08 | 3.718174 | -8.784219 |
| 1 | ABBV | AbbVie | Health Care | Pharmaceuticals | 59.240002 | 8.339433 | 2.197887 | 130 | 77 | 51000000 | 5144000000 | 3.15 | 1.633016e+09 | 18.806350 | -8.750068 |
| 2 | ABT | Abbott Laboratories | Health Care | Health Care Equipment | 44.910000 | 11.301121 | 1.273646 | 21 | 67 | 938000000 | 4423000000 | 2.94 | 1.504422e+09 | 15.275510 | -0.394171 |
| 3 | ADBE | Adobe Systems Inc | Information Technology | Application Software | 93.940002 | 13.977195 | 1.357679 | 9 | 180 | -240840000 | 629551000 | 1.26 | 4.996437e+08 | 74.555557 | 4.199651 |
| 4 | ADI | Analog Devices, Inc. | Information Technology | Semiconductors | 55.320000 | -1.827858 | 1.701169 | 14 | 272 | 315120000 | 696878000 | 0.31 | 2.247994e+09 | 178.451613 | 1.059810 |
#scale data
scaler = StandardScaler()
subset = df[num_cols].copy()
subset_scaled = scaler.fit_transform(subset)
#create scaled dataframe
subset_scaled_frame = pd.DataFrame(subset_scaled, columns=subset.columns)
subset_scaled_frame.head()
| Current Price | Price Change | Volatility | ROE | Cash Ratio | Net Cash Flow | Net Income | Earnings Per Share | Estimated Shares Outstanding | P/E Ratio | P/B Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | -0.393341 | 0.493950 | 0.272749 | 0.989601 | -0.210698 | -0.339355 | 1.554415 | 1.309399 | 0.107863 | -0.652487 | -0.506653 |
| 1 | -0.220837 | 0.355439 | 1.137045 | 0.937737 | 0.077269 | -0.002335 | 0.927628 | 0.056755 | 1.250274 | -0.311769 | -0.504205 |
| 2 | -0.367195 | 0.602479 | -0.427007 | -0.192905 | -0.033488 | 0.454058 | 0.744371 | 0.024831 | 1.098021 | -0.391502 | 0.094941 |
| 3 | 0.133567 | 0.825696 | -0.284802 | -0.317379 | 1.218059 | -0.152497 | -0.219816 | -0.230563 | -0.091622 | 0.947148 | 0.424333 |
| 4 | -0.260874 | -0.492636 | 0.296470 | -0.265515 | 2.237018 | 0.133564 | -0.202703 | -0.374982 | 1.978399 | 3.293307 | 0.199196 |
plt.figure(figsize=(15, 12))
for i, variable in enumerate(num_cols):
plt.subplot(3, 4, i + 1)
plt.boxplot(subset_scaled_frame[variable], whis=1.5)
plt.tight_layout()
plt.title(variable)
plt.show()
km_df = subset_scaled_frame.copy()
#investigate clusters and elbow plot
#set range based on number of columns
clusters=range(1,15)
meanDistortions=[]
for k in clusters:
model = KMeans(n_clusters=k)
model.fit(km_df)
prediction = model.predict(km_df)
distortion = (
sum(np.min(cdist(subset_scaled_frame, model.cluster_centers_, "euclidean"), axis=1))
/ subset_scaled_frame.shape[0]
)
meanDistortions.append(distortion)
#plot elbow
plt.plot(clusters, meanDistortions, 'bx-')
plt.xlabel('k')
plt.ylabel('Average distortion')
plt.title('Selecting k with the Elbow Method')
plt.show()
There seem to be elbows at k=5 and k=13. We can continue by using the KElbowVisualizer.
model = KMeans(random_state=1)
visualizer = KElbowVisualizer(model, k=(2, 15), timings=True)
visualizer.fit(km_df) # fit the data to the visualizer
visualizer.show() # finalize and render figure
<Axes: title={'center': 'Distortion Score Elbow for KMeans Clustering'}, xlabel='k', ylabel='distortion score'>
#gather scores
#plot to compare
sil_score = []
cluster_list = list(range(2, 15))
for n_clusters in cluster_list:
clusterer = KMeans(n_clusters=n_clusters, random_state=1)
preds = clusterer.fit_predict((km_df))
score = silhouette_score(km_df, preds)
sil_score.append(score)
print("For n_clusters = {}, silhouette score is {}".format(n_clusters, score))
plt.plot(cluster_list, sil_score)
For n_clusters = 2, silhouette score is 0.43969639509980457 For n_clusters = 3, silhouette score is 0.4644405674779404 For n_clusters = 4, silhouette score is 0.45434371948348606 For n_clusters = 5, silhouette score is 0.43169988466492354 For n_clusters = 6, silhouette score is 0.36698524210914063 For n_clusters = 7, silhouette score is 0.10587349598463779 For n_clusters = 8, silhouette score is 0.40021596082234684 For n_clusters = 9, silhouette score is 0.4051539505522535 For n_clusters = 10, silhouette score is 0.1147132918355368 For n_clusters = 11, silhouette score is 0.16268826704862685 For n_clusters = 12, silhouette score is 0.12883008320005324 For n_clusters = 13, silhouette score is 0.1744804713048242 For n_clusters = 14, silhouette score is 0.17457839075773943
[<matplotlib.lines.Line2D at 0x78df0e5b1900>]
#plot elbow using new metric
model = KMeans(random_state=1)
visualizer = KElbowVisualizer(model, k=(2, 15), metric="silhouette", timings=True)
visualizer.fit(km_df)
visualizer.show()
<Axes: title={'center': 'Silhouette Score Elbow for KMeans Clustering'}, xlabel='k', ylabel='silhouette score'>
# checking cluster=2 with silhouette coefficients
visualizer = SilhouetteVisualizer(KMeans(2, random_state=1))
visualizer.fit(km_df)
visualizer.show()
<Axes: title={'center': 'Silhouette Plot of KMeans Clustering for 340 Samples in 2 Centers'}, xlabel='silhouette coefficient values', ylabel='cluster label'>
# checking cluster=3 with silhouette coefficients
visualizer = SilhouetteVisualizer(KMeans(3, random_state=1))
visualizer.fit(km_df)
visualizer.show()
<Axes: title={'center': 'Silhouette Plot of KMeans Clustering for 340 Samples in 3 Centers'}, xlabel='silhouette coefficient values', ylabel='cluster label'>
# checking cluster=4 with silhouette coefficients
visualizer = SilhouetteVisualizer(KMeans(4, random_state=1))
visualizer.fit(km_df)
visualizer.show()
<Axes: title={'center': 'Silhouette Plot of KMeans Clustering for 340 Samples in 4 Centers'}, xlabel='silhouette coefficient values', ylabel='cluster label'>
# checking cluster=5 with silhouette coefficients
visualizer = SilhouetteVisualizer(KMeans(5, random_state=1))
visualizer.fit(km_df)
visualizer.show()
<Axes: title={'center': 'Silhouette Plot of KMeans Clustering for 340 Samples in 5 Centers'}, xlabel='silhouette coefficient values', ylabel='cluster label'>
# checking cluster=8 with silhouette coefficients
visualizer = SilhouetteVisualizer(KMeans(8, random_state=1))
visualizer.fit(km_df)
visualizer.show()
<Axes: title={'center': 'Silhouette Plot of KMeans Clustering for 340 Samples in 8 Centers'}, xlabel='silhouette coefficient values', ylabel='cluster label'>
# checking cluster=13 with silhouette coefficients
visualizer = SilhouetteVisualizer(KMeans(13, random_state=1))
visualizer.fit(km_df)
visualizer.show()
<Axes: title={'center': 'Silhouette Plot of KMeans Clustering for 340 Samples in 13 Centers'}, xlabel='silhouette coefficient values', ylabel='cluster label'>
#state and fit model
kmean_model = KMeans(n_clusters=3, random_state=1)
kmean_model.fit(km_df)
KMeans(n_clusters=3, random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
KMeans(n_clusters=3, random_state=1)
#copy original data
df_finalkm = df.copy()
#add the cluster labels to the original data
df_finalkm['Cluster Group'] = kmean_model.labels_
#add the cluster labels to the scaled data
subset_scaled_frame['Cluster Group'] = kmean_model.labels_
#view original data
df_finalkm.head()
| Ticker Symbol | Security | GICS Sector | GICS Sub Industry | Current Price | Price Change | Volatility | ROE | Cash Ratio | Net Cash Flow | Net Income | Earnings Per Share | Estimated Shares Outstanding | P/E Ratio | P/B Ratio | Cluster Group | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | AAL | American Airlines Group | Industrials | Airlines | 42.349998 | 9.999995 | 1.687151 | 135 | 51 | -604000000 | 7610000000 | 11.39 | 6.681299e+08 | 3.718174 | -8.784219 | 2 |
| 1 | ABBV | AbbVie | Health Care | Pharmaceuticals | 59.240002 | 8.339433 | 2.197887 | 130 | 77 | 51000000 | 5144000000 | 3.15 | 1.633016e+09 | 18.806350 | -8.750068 | 2 |
| 2 | ABT | Abbott Laboratories | Health Care | Health Care Equipment | 44.910000 | 11.301121 | 1.273646 | 21 | 67 | 938000000 | 4423000000 | 2.94 | 1.504422e+09 | 15.275510 | -0.394171 | 2 |
| 3 | ADBE | Adobe Systems Inc | Information Technology | Application Software | 93.940002 | 13.977195 | 1.357679 | 9 | 180 | -240840000 | 629551000 | 1.26 | 4.996437e+08 | 74.555557 | 4.199651 | 2 |
| 4 | ADI | Analog Devices, Inc. | Information Technology | Semiconductors | 55.320000 | -1.827858 | 1.701169 | 14 | 272 | 315120000 | 696878000 | 0.31 | 2.247994e+09 | 178.451613 | 1.059810 | 0 |
#group by cluster categories and get mean values of numeric data
cluster_profile = df_finalkm.groupby('Cluster Group').mean(numeric_only=True)
#count members of each cluster
cluster_profile['Members Count']= (df_finalkm.groupby('Cluster Group')['Ticker Symbol'].count().values)
#print companies in each cluster
for cl in df_finalkm["Cluster Group"].unique():
print("In cluster {}, the following companies are present:".format(cl))
print(df_finalkm[df_finalkm["Cluster Group"] == cl]["Security"].unique())
print()
In cluster 2, the following companies are present: ['American Airlines Group' 'AbbVie' 'Abbott Laboratories' 'Adobe Systems Inc' 'Archer-Daniels-Midland Co' 'Alliance Data Systems' 'Ameren Corp' 'American Electric Power' 'AFLAC Inc' 'American International Group, Inc.' 'Apartment Investment & Mgmt' 'Assurant Inc' 'Arthur J. Gallagher & Co.' 'Akamai Technologies Inc' 'Albemarle Corp' 'Alaska Air Group Inc' 'Allstate Corp' 'Allegion' 'Applied Materials Inc' 'AMETEK Inc' 'Affiliated Managers Group Inc' 'Amgen Inc' 'Ameriprise Financial' 'American Tower Corp A' 'AutoNation Inc' 'Anthem Inc.' 'Aon plc' 'Amphenol Corp' 'Arconic Inc' 'Activision Blizzard' 'AvalonBay Communities, Inc.' 'Broadcom' 'American Water Works Company Inc' 'American Express Co' 'Boeing Company' 'Baxter International Inc.' 'BB&T Corporation' 'Bard (C.R.) Inc.' 'BIOGEN IDEC Inc.' 'The Bank of New York Mellon Corp.' 'Ball Corp' 'Bristol-Myers Squibb' 'Boston Scientific' 'BorgWarner' 'Boston Properties' 'Caterpillar Inc.' 'Chubb Limited' 'CBRE Group' 'Crown Castle International Corp.' 'Carnival Corp.' 'Celgene Corp.' 'CF Industries Holdings Inc' 'Citizens Financial Group' 'Church & Dwight' 'C. H. Robinson Worldwide' 'Charter Communications' 'CIGNA Corp.' 'Cincinnati Financial' 'Colgate-Palmolive' 'Comerica Inc.' 'CME Group Inc.' 'Chipotle Mexican Grill' 'Cummins Inc.' 'CMS Energy' 'Centene Corporation' 'CenterPoint Energy' 'Capital One Financial' 'The Cooper Companies' 'CSX Corp.' 'CenturyLink Inc' 'Cognizant Technology Solutions' 'Citrix Systems' 'CVS Health' 'Chevron Corp.' 'Dominion Resources' 'Delta Air Lines' 'Du Pont (E.I.)' 'Deere & Co.' 'Discover Financial Services' 'Quest Diagnostics' 'Danaher Corp.' 'The Walt Disney Company' 'Discovery Communications-A' 'Discovery Communications-C' 'Delphi Automotive' 'Digital Realty Trust' 'Dun & Bradstreet' 'Dover Corp.' 'Dr Pepper Snapple Group' 'Duke Energy' 'DaVita Inc.' 'eBay Inc.' 'Ecolab Inc.' 'Consolidated Edison' 'Equifax Inc.' "Edison Int'l" 'Eastman Chemical' 'Equinix' 'Equity Residential' 'Eversource Energy' 'Essex Property Trust, Inc.' 'E*Trade' 'Eaton Corporation' 'Entergy Corp.' 'Edwards Lifesciences' 'Exelon Corp.' "Expeditors Int'l" 'Expedia Inc.' 'Extra Space Storage' 'Fastenal Co' 'Fortune Brands Home & Security' 'FirstEnergy Corp' 'Fidelity National Information Services' 'Fiserv Inc' 'FLIR Systems' 'Fluor Corp.' 'Flowserve Corporation' 'FMC Corporation' 'Federal Realty Investment Trust' 'First Solar Inc' 'Frontier Communications' 'General Dynamics' 'General Growth Properties Inc.' 'Corning Inc.' 'General Motors' 'Genuine Parts' 'Garmin Ltd.' 'Goodyear Tire & Rubber' 'Grainger (W.W.) Inc.' 'Hasbro Inc.' 'Huntington Bancshares' 'HCA Holdings' 'Welltower Inc.' 'HCP Inc.' 'Hartford Financial Svc.Gp.' 'Harley-Davidson' "Honeywell Int'l Inc." 'HP Inc.' 'Hormel Foods Corp.' 'Henry Schein' 'Host Hotels & Resorts' 'The Hershey Company' 'Humana Inc.' 'International Business Machines' 'IDEXX Laboratories' 'Intl Flavors & Fragrances' 'International Paper' 'Interpublic Group' 'Iron Mountain Incorporated' 'Intuitive Surgical Inc.' 'Illinois Tool Works' 'Invesco Ltd.' 'J. B. Hunt Transport Services' 'Jacobs Engineering Group' 'Juniper Networks' 'Kimco Realty' 'Kimberly-Clark' 'Kansas City Southern' 'Leggett & Platt' 'Lennar Corp.' 'Laboratory Corp. of America Holding' 'LKQ Corporation' 'L-3 Communications Holdings' 'Lilly (Eli) & Co.' 'Lockheed Martin Corp.' 'Alliant Energy Corp' 'Leucadia National Corp.' 'Southwest Airlines' 'Level 3 Communications' 'LyondellBasell' 'Mastercard Inc.' 'Mid-America Apartments' 'Macerich' "Marriott Int'l." 'Masco Corp.' 'Mattel Inc.' "McDonald's Corp." "Moody's Corp" 'Mondelez International' 'MetLife Inc.' 'Mohawk Industries' 'Mead Johnson' 'McCormick & Co.' 'Martin Marietta Materials' 'Marsh & McLennan' '3M Company' 'Monster Beverage' 'Altria Group Inc' 'The Mosaic Company' 'Marathon Petroleum' 'M&T Bank Corp.' 'Mettler Toledo' 'Mylan N.V.' 'Navient' 'NASDAQ OMX Group' 'NextEra Energy' 'Newmont Mining Corp. (Hldg. Co.)' 'Nielsen Holdings' 'Norfolk Southern Corp.' 'Northern Trust Corp.' 'Nucor Corp.' 'Newell Brands' 'Realty Income Corporation' 'Omnicom Group' "O'Reilly Automotive" "People's United Financial" 'Pitney-Bowes' 'PACCAR Inc.' 'PG&E Corp.' 'Priceline.com Inc' 'Public Serv. Enterprise Inc.' 'PepsiCo Inc.' 'Principal Financial Group' 'Procter & Gamble' 'Progressive Corp.' 'Pulte Homes Inc.' 'Philip Morris International' 'PNC Financial Services' 'Pentair Ltd.' 'Pinnacle West Capital' 'PPG Industries' 'PPL Corp.' 'Prudential Financial' 'Phillips 66' 'Praxair Inc.' 'PayPal' 'Ryder System' 'Royal Caribbean Cruises Ltd' 'Regeneron' 'Robert Half International' 'Roper Industries' 'Republic Services Inc' 'SCANA Corp' 'Charles Schwab Corporation' 'Spectra Energy Corp.' 'Sealed Air' 'Sherwin-Williams' 'SL Green Realty' 'Scripps Networks Interactive Inc.' 'Southern Co.' 'Simon Property Group Inc' 'S&P Global, Inc.' 'Stericycle Inc' 'Sempra Energy' 'SunTrust Banks' 'State Street Corp.' 'Skyworks Solutions' 'Synchrony Financial' 'Stryker Corp.' 'Molson Coors Brewing Company' 'Tegna, Inc.' 'Torchmark Corp.' 'Thermo Fisher Scientific' 'TripAdvisor' 'The Travelers Companies Inc.' 'Tractor Supply Company' 'Tyson Foods' 'Tesoro Petroleum Co.' 'Total System Services' 'Texas Instruments' 'Under Armour' 'United Continental Holdings' 'UDR Inc' 'Universal Health Services, Inc.' 'United Health Group Inc.' 'Unum Group' 'Union Pacific' 'United Parcel Service' 'United Technologies' 'Varian Medical Systems' 'Valero Energy' 'Vulcan Materials' 'Vornado Realty Trust' 'Verisk Analytics' 'Verisign Inc.' 'Vertex Pharmaceuticals Inc' 'Ventas Inc' 'Waters Corporation' 'Wec Energy Group Inc' 'Whirlpool Corp.' 'Waste Management Inc.' 'Western Union Co' 'Weyerhaeuser Corp.' 'Wyndham Worldwide' 'Xcel Energy Inc' 'XL Capital' 'Dentsply Sirona' 'Xerox Corp.' 'Xylem Inc.' 'Yahoo Inc.' 'Yum! Brands Inc' 'Zimmer Biomet Holdings' 'Zions Bancorp' 'Zoetis'] In cluster 0, the following companies are present: ['Analog Devices, Inc.' 'Alexion Pharmaceuticals' 'Amazon.com Inc' 'Apache Corporation' 'Anadarko Petroleum Corp' 'Baker Hughes Inc' 'Chesapeake Energy' 'Cabot Oil & Gas' 'Concho Resources' 'Devon Energy Corp.' 'EOG Resources' 'EQT Corporation' 'Freeport-McMoran Cp & Gld' 'Halliburton Co.' 'Hess Corporation' 'Hewlett Packard Enterprise' 'Kinder Morgan' 'Marathon Oil Corp.' 'Murphy Oil' 'Noble Energy Inc' 'Netflix Inc.' 'Newfield Exploration Co' 'National Oilwell Varco Inc.' 'ONEOK' 'Occidental Petroleum' 'Quanta Services Inc.' 'Range Resources Corp.' 'Southwestern Energy' 'Teradata Corp.' 'Williams Cos.' 'Wynn Resorts Ltd' 'Cimarex Energy'] In cluster 1, the following companies are present: ['Bank of America Corp' 'Citigroup Inc.' 'Ford Motor' 'Facebook' 'Gilead Sciences' 'Intel Corp.' 'JPMorgan Chase & Co.' 'Coca Cola Company' 'Merck & Co.' 'Pfizer Inc.' 'AT&T Inc' 'Verizon Communications' 'Wells Fargo' 'Exxon Mobil Corp.']
#display the profile
cluster_profile.style.highlight_max(color='yellow', axis=0)
| Current Price | Price Change | Volatility | ROE | Cash Ratio | Net Cash Flow | Net Income | Earnings Per Share | Estimated Shares Outstanding | P/E Ratio | P/B Ratio | Members Count | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Cluster Group | ||||||||||||
| 0 | 64.183438 | -10.557046 | 2.797776 | 96.531250 | 70.718750 | 159171125.000000 | -3250005968.750000 | -7.886875 | 526459323.057500 | 111.333230 | 1.783445 | 32 |
| 1 | 52.142857 | 6.779993 | 1.175153 | 26.142857 | 140.142857 | 760285714.285714 | 13368785714.285715 | 3.769286 | 3838879870.871428 | 20.654832 | -3.529270 | 14 |
| 2 | 84.045331 | 5.542488 | 1.404255 | 34.040816 | 66.608844 | 10698350.340136 | 1445333183.673469 | 3.890051 | 427206184.715408 | 24.613743 | -2.013147 | 294 |
#view sectors in clusters
df_finalkm.groupby(["Cluster Group", "GICS Sector"])['Security'].count()
| Security | ||
|---|---|---|
| Cluster Group | GICS Sector | |
| 0 | Consumer Discretionary | 2 |
| Energy | 23 | |
| Health Care | 1 | |
| Industrials | 1 | |
| Information Technology | 4 | |
| Materials | 1 | |
| 1 | Consumer Discretionary | 1 |
| Consumer Staples | 1 | |
| Energy | 1 | |
| Financials | 4 | |
| Health Care | 3 | |
| Information Technology | 2 | |
| Telecommunications Services | 2 | |
| 2 | Consumer Discretionary | 37 |
| Consumer Staples | 18 | |
| Energy | 6 | |
| Financials | 45 | |
| Health Care | 36 | |
| Industrials | 52 | |
| Information Technology | 27 | |
| Materials | 19 | |
| Real Estate | 27 | |
| Telecommunications Services | 3 | |
| Utilities | 24 |
#boxplots of numerical attributes
#separate by cluster
plt.figure(figsize=(15, 10))
plt.suptitle("Boxplots of Numerical Attributes by Cluster")
for i, variable in enumerate(num_cols):
plt.subplot(3, 4, i + 1)
sns.boxplot(data=df_finalkm, x="Cluster Group", y=variable)
plt.tight_layout(pad=2.0)
Insights Regarding Cluster Characteristics:
#copy the scaled data
harchy_df = subset_scaled_frame.copy()
# list of distance metrics
distance_metrics = ["euclidean", "chebyshev", "mahalanobis", "cityblock"]
# list of linkage methods
linkage_methods = ["single", "complete", "average", "weighted"]
high_cophenet_corr = 0
high_dm_lm = [0, 0]
for dm in distance_metrics:
for lm in linkage_methods:
Z = linkage(harchy_df, metric=dm, method=lm)
c, coph_dists = cophenet(Z, pdist(harchy_df))
print(
"Cophenetic correlation for {} distance and {} linkage is {}.".format(
dm.capitalize(), lm, c
)
)
if high_cophenet_corr < c:
high_cophenet_corr = c
high_dm_lm[0] = dm
high_dm_lm[1] = lm
Cophenetic correlation for Euclidean distance and single linkage is 0.9258232807603518. Cophenetic correlation for Euclidean distance and complete linkage is 0.8702339998778295. Cophenetic correlation for Euclidean distance and average linkage is 0.9429573458495599. Cophenetic correlation for Euclidean distance and weighted linkage is 0.8659146797006966. Cophenetic correlation for Chebyshev distance and single linkage is 0.9095824236261919. Cophenetic correlation for Chebyshev distance and complete linkage is 0.8171563833185052. Cophenetic correlation for Chebyshev distance and average linkage is 0.9324662171370572. Cophenetic correlation for Chebyshev distance and weighted linkage is 0.9103186591079586. Cophenetic correlation for Mahalanobis distance and single linkage is 0.9256950378508784. Cophenetic correlation for Mahalanobis distance and complete linkage is 0.8273180174929547. Cophenetic correlation for Mahalanobis distance and average linkage is 0.9352147105001555. Cophenetic correlation for Mahalanobis distance and weighted linkage is 0.8873052895448271. Cophenetic correlation for Cityblock distance and single linkage is 0.9349020977687301. Cophenetic correlation for Cityblock distance and complete linkage is 0.8094729156295056. Cophenetic correlation for Cityblock distance and average linkage is 0.9076079763586995. Cophenetic correlation for Cityblock distance and weighted linkage is 0.6642200849989767.
# print distance metric combined with linkage method that output the highest CC
print(
"Highest cophenetic correlation is {}, which is obtained with {} distance and {} linkage.".format(
high_cophenet_corr, high_dm_lm[0].capitalize(), high_dm_lm[1]
)
)
Highest cophenetic correlation is 0.9429573458495599, which is obtained with Euclidean distance and average linkage.
#Check linkage Dendrograms for Euclidean Distance
#highest CC score was with Euclidean and average link
# list of linkage methods
linkage_methods = ["single", "complete", "average", "centroid", "ward", "weighted"]
# lists to save results of cophenetic correlation calculation
compare_cols = ["Linkage", "Cophenetic Coefficient"]
# to create a subplot image
fig, axs = plt.subplots(len(linkage_methods), 1, figsize=(15, 30))
#enumerate through the list of linkage methods
# For each linkage method, plot the dendrogram and calculate the cophenetic correlation
for i, method in enumerate(linkage_methods):
Z = linkage(harchy_df, metric="euclidean", method=method)
dendrogram(Z, ax=axs[i])
axs[i].set_title(f"Dendrogram ({method.capitalize()} Linkage)")
coph_corr, coph_dist = cophenet(Z, pdist(harchy_df))
axs[i].annotate(
f"Cophenetic\nCorrelation\n{coph_corr:0.2f}",
(0.80, 0.80),
xycoords="axes fraction",
)
#euclidean average hierarchy model
HCmodel = AgglomerativeClustering(n_clusters=6, metric="euclidean", linkage="average")
HCmodel.fit(harchy_df)
AgglomerativeClustering(linkage='average', metric='euclidean', n_clusters=6)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
AgglomerativeClustering(linkage='average', metric='euclidean', n_clusters=6)
#copy original data
df_euclidean_harchy = df.copy()
#add the cluster labels to the original data
df_euclidean_harchy['E Cluster Group'] = HCmodel.labels_
#add the cluster labels to the scaled data
subset_scaled_frame['E Cluster Group'] = HCmodel.labels_
#group by cluster categories and get mean values of numeric data
HCcluster_profile = df_euclidean_harchy.groupby('E Cluster Group').mean(numeric_only=True)
HCcluster_profile["Member Count"] = (
df_euclidean_harchy.groupby("E Cluster Group")["Ticker Symbol"].count().values)
HCcluster_profile.style.highlight_max(color="yellow", axis=0)
| Current Price | Price Change | Volatility | ROE | Cash Ratio | Net Cash Flow | Net Income | Earnings Per Share | Estimated Shares Outstanding | P/E Ratio | P/B Ratio | Member Count | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| E Cluster Group | ||||||||||||
| 0 | 76.088430 | 4.021415 | 1.517035 | 35.414414 | 66.801802 | -36927396.396396 | 1538076291.291291 | 2.908183 | 559307820.632432 | 30.949218 | -1.799065 | 333 |
| 1 | 24.485001 | -13.351992 | 3.482611 | 802.000000 | 51.000000 | -1292500000.000000 | -19106500000.000000 | -41.815000 | 519573983.250000 | 60.748608 | 1.565141 | 2 |
| 2 | 25.640000 | 11.237908 | 1.322355 | 12.500000 | 130.500000 | 16755500000.000000 | 13654000000.000000 | 3.295000 | 2791829362.100000 | 13.649696 | 1.508484 | 2 |
| 3 | 1274.949951 | 3.190527 | 1.268340 | 29.000000 | 184.000000 | -1671386000.000000 | 2551360000.000000 | 50.090000 | 50935516.070000 | 25.453183 | -1.052429 | 1 |
| 4 | 104.660004 | 16.224320 | 1.320606 | 8.000000 | 958.000000 | 592000000.000000 | 3669000000.000000 | 1.310000 | 2800763359.000000 | 79.893133 | 5.884467 | 1 |
| 5 | 675.890015 | 32.268105 | 1.460386 | 4.000000 | 58.000000 | 1333000000.000000 | 596000000.000000 | 1.280000 | 465625000.000000 | 528.039074 | 3.904430 | 1 |
#print companies in each cluster
for cl in df_euclidean_harchy["E Cluster Group"].unique():
print("In cluster {}, the following companies are present:".format(cl))
print(df_finalkm[df_euclidean_harchy["E Cluster Group"] == cl]["Security"].unique())
print()
In cluster 0, the following companies are present: ['American Airlines Group' 'AbbVie' 'Abbott Laboratories' 'Adobe Systems Inc' 'Analog Devices, Inc.' 'Archer-Daniels-Midland Co' 'Alliance Data Systems' 'Ameren Corp' 'American Electric Power' 'AFLAC Inc' 'American International Group, Inc.' 'Apartment Investment & Mgmt' 'Assurant Inc' 'Arthur J. Gallagher & Co.' 'Akamai Technologies Inc' 'Albemarle Corp' 'Alaska Air Group Inc' 'Allstate Corp' 'Allegion' 'Alexion Pharmaceuticals' 'Applied Materials Inc' 'AMETEK Inc' 'Affiliated Managers Group Inc' 'Amgen Inc' 'Ameriprise Financial' 'American Tower Corp A' 'AutoNation Inc' 'Anthem Inc.' 'Aon plc' 'Anadarko Petroleum Corp' 'Amphenol Corp' 'Arconic Inc' 'Activision Blizzard' 'AvalonBay Communities, Inc.' 'Broadcom' 'American Water Works Company Inc' 'American Express Co' 'Boeing Company' 'Baxter International Inc.' 'BB&T Corporation' 'Bard (C.R.) Inc.' 'Baker Hughes Inc' 'BIOGEN IDEC Inc.' 'The Bank of New York Mellon Corp.' 'Ball Corp' 'Bristol-Myers Squibb' 'Boston Scientific' 'BorgWarner' 'Boston Properties' 'Citigroup Inc.' 'Caterpillar Inc.' 'Chubb Limited' 'CBRE Group' 'Crown Castle International Corp.' 'Carnival Corp.' 'Celgene Corp.' 'CF Industries Holdings Inc' 'Citizens Financial Group' 'Church & Dwight' 'C. H. Robinson Worldwide' 'Charter Communications' 'CIGNA Corp.' 'Cincinnati Financial' 'Colgate-Palmolive' 'Comerica Inc.' 'CME Group Inc.' 'Chipotle Mexican Grill' 'Cummins Inc.' 'CMS Energy' 'Centene Corporation' 'CenterPoint Energy' 'Capital One Financial' 'Cabot Oil & Gas' 'The Cooper Companies' 'CSX Corp.' 'CenturyLink Inc' 'Cognizant Technology Solutions' 'Citrix Systems' 'CVS Health' 'Chevron Corp.' 'Concho Resources' 'Dominion Resources' 'Delta Air Lines' 'Du Pont (E.I.)' 'Deere & Co.' 'Discover Financial Services' 'Quest Diagnostics' 'Danaher Corp.' 'The Walt Disney Company' 'Discovery Communications-A' 'Discovery Communications-C' 'Delphi Automotive' 'Digital Realty Trust' 'Dun & Bradstreet' 'Dover Corp.' 'Dr Pepper Snapple Group' 'Duke Energy' 'DaVita Inc.' 'Devon Energy Corp.' 'eBay Inc.' 'Ecolab Inc.' 'Consolidated Edison' 'Equifax Inc.' "Edison Int'l" 'Eastman Chemical' 'EOG Resources' 'Equinix' 'Equity Residential' 'EQT Corporation' 'Eversource Energy' 'Essex Property Trust, Inc.' 'E*Trade' 'Eaton Corporation' 'Entergy Corp.' 'Edwards Lifesciences' 'Exelon Corp.' "Expeditors Int'l" 'Expedia Inc.' 'Extra Space Storage' 'Ford Motor' 'Fastenal Co' 'Fortune Brands Home & Security' 'Freeport-McMoran Cp & Gld' 'FirstEnergy Corp' 'Fidelity National Information Services' 'Fiserv Inc' 'FLIR Systems' 'Fluor Corp.' 'Flowserve Corporation' 'FMC Corporation' 'Federal Realty Investment Trust' 'First Solar Inc' 'Frontier Communications' 'General Dynamics' 'General Growth Properties Inc.' 'Gilead Sciences' 'Corning Inc.' 'General Motors' 'Genuine Parts' 'Garmin Ltd.' 'Goodyear Tire & Rubber' 'Grainger (W.W.) Inc.' 'Halliburton Co.' 'Hasbro Inc.' 'Huntington Bancshares' 'HCA Holdings' 'Welltower Inc.' 'HCP Inc.' 'Hess Corporation' 'Hartford Financial Svc.Gp.' 'Harley-Davidson' "Honeywell Int'l Inc." 'Hewlett Packard Enterprise' 'HP Inc.' 'Hormel Foods Corp.' 'Henry Schein' 'Host Hotels & Resorts' 'The Hershey Company' 'Humana Inc.' 'International Business Machines' 'IDEXX Laboratories' 'Intl Flavors & Fragrances' 'International Paper' 'Interpublic Group' 'Iron Mountain Incorporated' 'Intuitive Surgical Inc.' 'Illinois Tool Works' 'Invesco Ltd.' 'J. B. Hunt Transport Services' 'Jacobs Engineering Group' 'Juniper Networks' 'JPMorgan Chase & Co.' 'Kimco Realty' 'Kimberly-Clark' 'Kinder Morgan' 'Coca Cola Company' 'Kansas City Southern' 'Leggett & Platt' 'Lennar Corp.' 'Laboratory Corp. of America Holding' 'LKQ Corporation' 'L-3 Communications Holdings' 'Lilly (Eli) & Co.' 'Lockheed Martin Corp.' 'Alliant Energy Corp' 'Leucadia National Corp.' 'Southwest Airlines' 'Level 3 Communications' 'LyondellBasell' 'Mastercard Inc.' 'Mid-America Apartments' 'Macerich' "Marriott Int'l." 'Masco Corp.' 'Mattel Inc.' "McDonald's Corp." "Moody's Corp" 'Mondelez International' 'MetLife Inc.' 'Mohawk Industries' 'Mead Johnson' 'McCormick & Co.' 'Martin Marietta Materials' 'Marsh & McLennan' '3M Company' 'Monster Beverage' 'Altria Group Inc' 'The Mosaic Company' 'Marathon Petroleum' 'Merck & Co.' 'Marathon Oil Corp.' 'M&T Bank Corp.' 'Mettler Toledo' 'Murphy Oil' 'Mylan N.V.' 'Navient' 'Noble Energy Inc' 'NASDAQ OMX Group' 'NextEra Energy' 'Newmont Mining Corp. (Hldg. Co.)' 'Netflix Inc.' 'Newfield Exploration Co' 'Nielsen Holdings' 'National Oilwell Varco Inc.' 'Norfolk Southern Corp.' 'Northern Trust Corp.' 'Nucor Corp.' 'Newell Brands' 'Realty Income Corporation' 'ONEOK' 'Omnicom Group' "O'Reilly Automotive" 'Occidental Petroleum' "People's United Financial" 'Pitney-Bowes' 'PACCAR Inc.' 'PG&E Corp.' 'Public Serv. Enterprise Inc.' 'PepsiCo Inc.' 'Pfizer Inc.' 'Principal Financial Group' 'Procter & Gamble' 'Progressive Corp.' 'Pulte Homes Inc.' 'Philip Morris International' 'PNC Financial Services' 'Pentair Ltd.' 'Pinnacle West Capital' 'PPG Industries' 'PPL Corp.' 'Prudential Financial' 'Phillips 66' 'Quanta Services Inc.' 'Praxair Inc.' 'PayPal' 'Ryder System' 'Royal Caribbean Cruises Ltd' 'Regeneron' 'Robert Half International' 'Roper Industries' 'Range Resources Corp.' 'Republic Services Inc' 'SCANA Corp' 'Charles Schwab Corporation' 'Spectra Energy Corp.' 'Sealed Air' 'Sherwin-Williams' 'SL Green Realty' 'Scripps Networks Interactive Inc.' 'Southern Co.' 'Simon Property Group Inc' 'S&P Global, Inc.' 'Stericycle Inc' 'Sempra Energy' 'SunTrust Banks' 'State Street Corp.' 'Skyworks Solutions' 'Southwestern Energy' 'Synchrony Financial' 'Stryker Corp.' 'AT&T Inc' 'Molson Coors Brewing Company' 'Teradata Corp.' 'Tegna, Inc.' 'Torchmark Corp.' 'Thermo Fisher Scientific' 'TripAdvisor' 'The Travelers Companies Inc.' 'Tractor Supply Company' 'Tyson Foods' 'Tesoro Petroleum Co.' 'Total System Services' 'Texas Instruments' 'Under Armour' 'United Continental Holdings' 'UDR Inc' 'Universal Health Services, Inc.' 'United Health Group Inc.' 'Unum Group' 'Union Pacific' 'United Parcel Service' 'United Technologies' 'Varian Medical Systems' 'Valero Energy' 'Vulcan Materials' 'Vornado Realty Trust' 'Verisk Analytics' 'Verisign Inc.' 'Vertex Pharmaceuticals Inc' 'Ventas Inc' 'Verizon Communications' 'Waters Corporation' 'Wec Energy Group Inc' 'Wells Fargo' 'Whirlpool Corp.' 'Waste Management Inc.' 'Williams Cos.' 'Western Union Co' 'Weyerhaeuser Corp.' 'Wyndham Worldwide' 'Wynn Resorts Ltd' 'Cimarex Energy' 'Xcel Energy Inc' 'XL Capital' 'Exxon Mobil Corp.' 'Dentsply Sirona' 'Xerox Corp.' 'Xylem Inc.' 'Yahoo Inc.' 'Yum! Brands Inc' 'Zimmer Biomet Holdings' 'Zions Bancorp' 'Zoetis'] In cluster 5, the following companies are present: ['Amazon.com Inc'] In cluster 1, the following companies are present: ['Apache Corporation' 'Chesapeake Energy'] In cluster 2, the following companies are present: ['Bank of America Corp' 'Intel Corp.'] In cluster 4, the following companies are present: ['Facebook'] In cluster 3, the following companies are present: ['Priceline.com Inc']
df_euclidean_harchy.groupby(["E Cluster Group", "GICS Sector"])['Security'].count()
| Security | ||
|---|---|---|
| E Cluster Group | GICS Sector | |
| 0 | Consumer Discretionary | 38 |
| Consumer Staples | 19 | |
| Energy | 28 | |
| Financials | 48 | |
| Health Care | 40 | |
| Industrials | 53 | |
| Information Technology | 31 | |
| Materials | 20 | |
| Real Estate | 27 | |
| Telecommunications Services | 5 | |
| Utilities | 24 | |
| 1 | Energy | 2 |
| 2 | Financials | 1 |
| Information Technology | 1 | |
| 3 | Consumer Discretionary | 1 |
| 4 | Information Technology | 1 |
| 5 | Consumer Discretionary | 1 |
#boxplots of numerical attributes
#separate by cluster
plt.figure(figsize=(15, 10))
plt.suptitle("Boxplots of Numerical Attributes by Cluster")
for i, variable in enumerate(num_cols):
plt.subplot(3, 4, i + 1)
sns.boxplot(data=df_euclidean_harchy, x="E Cluster Group", y=variable)
plt.tight_layout(pad=2.0)
You compare several things, like:
You can also mention any differences or similarities you obtained in the cluster profiles from both the clustering techniques.